distillation dataset
Reinforcement Learning Teachers of Test Time Scaling
Cetin, Edoardo, Zhao, Tianyu, Tang, Yujin
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT
- Research Report (0.82)
- Instructional Material (0.67)
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
De Muri, Giovanni, Vero, Mark, Staab, Robin, Vechev, Martin
LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.
- North America > United States (0.04)
- Europe > United Kingdom > England (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
Detecting Distillation Data from Reasoning Models
Zhang, Hengxiang, Choi, Hyeong Kyu, Li, Sharon, Wei, Hongxin
Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method T oken Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset. Large Reasoning Models (LRMs) have shown impressive performance on complex tasks like mathematical reasoning and coding problems (Jaech et al., 2024; Guo et al., 2025; Y ang et al., 2025; xAI, 2025). By articulating intermediate steps via Chain-of-Thought (CoT), LRMs dynamically allocate extra compute to challenging problems. However, such reasoning capabilities are typically limited to LRMs exceeding 100 billion parameters, hindering practical deployment in resource-constrained settings (Wei et al., 2022). To address this, recent studies have explored reasoning distillation, transferring reasoning abilities from LRMs to Small Language Models (SLMs) by simulating reasoning traces (Chen et al., 2025; Y e et al., 2025; Muennighoff et al., 2025b; Liu et al., 2025). This paradigm has been widely applied in cutting-edge models, such as DeepSeek R1 series (Guo et al., 2025), Sky-T1-32B-preview (Team, 2025), and Bespoke-32B (Labs, 2025). In reasoning distillation, current methods generate reasoning trajectories and answers from LRMs for domain-specific questions, using these to supervise SLM training (Wu et al., 2025b; Li et al., 2025).
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
- (2 more...)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada (0.04)
- (3 more...)
- Education (0.96)
- Information Technology > Security & Privacy (0.68)
- Education (1.00)
- Government > Regional Government > North America Government > United States Government (0.46)
Review for NeurIPS paper: Ensemble Distillation for Robust Model Fusion in Federated Learning
Strengths: This work manifests solid understanding of key requirements and challenges of federated learning, and thus presents a practical solution with significant improvements. The contribution of this paper is formulating a robust, efficient training scheme in FL with extensive results and analysis, which is relevant to the NeurIPS community. They provide sufficient justifications about why the additional computations are negligible in practice and why the reduced number of communication rounds and the ability to handle architecture heterogeneity of FedDF matter more. The authors analyzed its contribution from various angles including efficiency, utilizing heterogeneous computation resources of clients, robustness on the choice of distillation dataset, and handling heterogeneous client data by mitigating quality loss of batch normalization with different data distributions. The results are sensible and believable.
Better Knowledge Enhancement for Privacy-Preserving Cross-Project Defect Prediction
Wang, Yuying, Li, Yichen, Wang, Haozhao, Zhao, Lei, Zhang, Xiaofang
Cross-Project Defect Prediction (CPDP) poses a non-trivial challenge to construct a reliable defect predictor by leveraging data from other projects, particularly when data owners are concerned about data privacy. In recent years, Federated Learning (FL) has become an emerging paradigm to guarantee privacy information by collaborative training a global model among multiple parties without sharing raw data. While the direct application of FL to the CPDP task offers a promising solution to address privacy concerns, the data heterogeneity arising from proprietary projects across different companies or organizations will bring troubles for model training. In this paper, we study the privacy-preserving cross-project defect prediction with data heterogeneity under the federated learning framework. To address this problem, we propose a novel knowledge enhancement approach named FedDP with two simple but effective solutions: 1. Local Heterogeneity Awareness and 2. Global Knowledge Distillation. Specifically, we employ open-source project data as the distillation dataset and optimize the global model with the heterogeneity-aware local model ensemble via knowledge distillation. Experimental results on 19 projects from two datasets demonstrate that our method significantly outperforms baselines.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (11 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.62)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation
Zhu, Xunyu, Li, Jian, Ma, Can, Wang, Weiping
Large Language Models (LLMs) demonstrate exceptional reasoning capabilities, often achieving state-of-the-art performance in various tasks. However, their substantial computational and memory demands, due to billions of parameters, hinder deployment in resource-constrained environments. A promising solution is knowledge distillation, where LLMs transfer reasoning capabilities to Small Language Models (SLMs, $\le$ 1B parameters), enabling wider deployment on low-resource devices. Existing methods primarily focus on generating high-quality reasoning rationales for distillation datasets but often neglect the critical role of data quantity and quality. To address these challenges, we propose a Feedback-Driven Distillation (FDD) framework to enhance SLMs' mathematical reasoning capabilities. In the initialization stage, a distillation dataset is constructed by prompting LLMs to pair mathematical problems with corresponding reasoning rationales. We classify problems into easy and hard categories based on SLM performance. For easy problems, LLMs generate more complex variations, while for hard problems, new questions of similar complexity are synthesized. In addition, we propose a multi-round distillation paradigm to iteratively enrich the distillation datasets, thereby progressively improving the mathematical reasoning abilities of SLMs. Experimental results demonstrate that our method can make SLMs achieve SOTA mathematical reasoning performance.
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (5 more...)
Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5
Lamott, Marcel, Shakir, Muhammad Armaghan
The surge of digital documents in various formats, including less standardized documents such as business reports and environmental assessments, underscores the growing importance of Document Understanding. While Large Language Models (LLMs) have showcased prowess across diverse natural language processing tasks, their direct application to Document Understanding remains a challenge. Previous research has demonstrated the utility of LLMs in this domain, yet their significant computational demands make them challenging to deploy effectively. Additionally, proprietary Blackbox LLMs often outperform their open-source counterparts, posing a barrier to widespread accessibility. In this paper, we delve into the realm of document understanding, leveraging distillation methods to harness the power of large LLMs while accommodating computational limitations. Specifically, we present a novel approach wherein we distill document understanding knowledge from the proprietary LLM ChatGPT into FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms to facilitate efficient knowledge transfer. This work contributes to the advancement of document understanding methodologies by offering a scalable solution that bridges the gap between resource-intensive LLMs and practical applications. Our findings underscore the potential of distillation techniques in facilitating the deployment of sophisticated language models in real-world scenarios, thereby fostering advancements in natural language processing and document comprehension domains.
- Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- (2 more...)
Federated Learning with a Single Shared Image
Soni, Sunny, Saeed, Aaqib, Asano, Yuki M.
Federated Learning (FL) enables multiple machines to collaboratively train a machine learning model without sharing of private training data. Yet, especially for heterogeneous models, a key bottleneck remains the transfer of knowledge gained from each client model with the server. One popular method, FedDF, uses distillation to tackle this task with the use of a common, shared dataset on which predictions are exchanged. However, in many contexts such a dataset might be difficult to acquire due to privacy and the clients might not allow for storage of a large shared dataset. To this end, in this paper, we introduce a new method that improves this knowledge distillation method to only rely on a single shared image between clients and server. In particular, we propose a novel adaptive dataset pruning algorithm that selects the most informative crops generated from only a single image. With this, we show that federated learning with distillation under a limited shared dataset budget works better by using a single image compared to multiple individual ones. Finally, we extend our approach to allow for training heterogeneous client architectures by incorporating a non-uniform distillation schedule and client-model mirroring on the server side.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Virginia (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)